Journal of Computational Biology
○ SAGE Publications
Preprints posted in the last 90 days, ranked by how well they match Journal of Computational Biology's content profile, based on 37 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Koshkarov, A.; Tahiri, N.
Show abstract
Phylogenetic trees represent the evolutionary histories of taxa and support tasks such as clustering and Tree of Life reconstruction. Many established comparison methods, including the Robinson-Foulds (RF) distance, assume identical taxon sets. A methodological gap remains for trees with distinct but overlapping taxa. Existing approaches either prune non-common leaves, which can discard information, or complete both trees such that they share the same taxa. Completion is more comprehensive, but current methods typically ignore branch lengths, which are essential for identifying evolutionary patterns. This paper introduces k-Nearest Common Leaves (k-NCL), an algorithm for completing rooted phylogenetic trees defined on different but overlapping taxa. The method uses branch lengths and topological characteristics and does not rely on a specific distance measure. The k-NCL algorithm is designed to preserve evolutionary relationships in the trees under comparison. The running time is O(n2), where n is the size of the union of the two leaf sets. Additional properties include preservation of original distances and topology, symmetry, and uniqueness of the completion. Implemented in Python, k-NCL is evaluated on biological datasets of amphibians, birds, mammals, and sharks. Experimental results show that RF combined with k-NCL improves phylogenetic tree clustering performance compared to the RF(+) tree completion approach. Availability and implementationAn open-source implementation of k-NCL in Python and the datasets used in this study are available at https://github.com/tahiri-lab/KNCL.
Frost, H. R.
Show abstract
We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.
Perez, G. J. G.; Perez-Rodriguez, R.; Gonzalez, A.
Show abstract
Common knowledge states that the spontaneous somatic evolution of a normal tissue may lead to a tumor. Once the tumor is formed, it naturally evolves towards a state of higher malignancy. On the other hand, perfect gene expression markers for normal tissue and tumor--the so-called N-genes and T-genes--were recently introduced. We join these two pieces of knowledge in order to argue that: 1) Only N-markers participate in the spontaneous dynamics of a normal tissue. The number of active markers decreases as the tissue approaches the transition point where it becomes a tumor. 2) Only T-markers participate in the spontaneous dynamics of tumors. The number of markers increases as the tumor becomes more malignant. 3) Both sets of genes are connected by the so-called NT-genes, i.e., genes that are simultaneously N- and T-markers. They should play a crucial role at the transition point and, possibly, when the tumor is exposed to a drug or therapy. 4) The pathways or mechanisms protecting the normal tissue from becoming a tumor may be described by a small perfect panel of N-genes. 5) The pathways or mechanisms guiding the evolution of tumors in a tissue may be described by a small perfect panel of T-genes. We illustrate the above statements with the analysis of expression data for prostate adenocarcinoma, one of the most heterogeneous tumors. In this case, there are about 1000 N-genes and 6000 T-genes, and the perfect N- and T-panels contain 11 and 8 genes, respectively. Additionally, we provide examples from lung adenocarcinoma and liver hepatocarcinoma.
Milkey, A.; Lewis, P. O.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWA new Bayesian measure of phylogenetic information content is introduced based on geodesic distances in treespace. The measure is based on the relative variance of phylogenetic trees sampled from the posterior distribution compared to the prior distribution. This ratio is expected to equal 1 if there is no information in the data about phylogeny and 0 if there is complete information. Trees can be scaled to have the same mean tree length to avoid dominance by edge length information and focus on topological information. The method scales well, requiring only that a valid sample can be obtained from both prior and posterior distributions. We show how dissonance (information conflict) among data sets can also be estimated. Both simulated and empirical examples are provided to illustrate that the new approach produces sensible and intuitive results.
Revell, L. J.; Alencar, L. R. V.; Alfaro, M. E.; Dain, J.; Hill, N. J.; Jones, M.; Martinet, K. M.; Romero-Alarcon, V.; Harmon, L. J.
Show abstract
The practical utility of many modern phylogenetic comparative methods can depend on how accurately mathematical models capture the evolutionary process of traits. Boucher and Demery (2016) described a new quantitative trait model, Brownian motion with reflective limits, that they anticipated might be of use in testing hypotheses about a particular sort of constraint on phenotypic character evolution. Since their analytic solution for the probability function under this bounded evolutionary scenario was not practical to evaluate for reasonably-sized trees, Boucher and Demery (2016) also identified a creative technique for computing the likelihood of their model. The basis of this methodology derives from the convergence of an equal-rates, symmetric, ordered Markov chain and continuous stochastic diffusion in the limit as the number of steps in our chain goes to {infty} (or, alternatively, as their widths decrease towards zero). We refer to this convergence in the limit as the discretized diffusion approximation or (more compactly) the discrete approximation. We realized that this discrete approximation of Boucher and Demery (2016) unlocked a number of additional models for the phylogenetic comparative analysis of discrete and continuous trait data, and we explore several of these in the present article. Specifically, we examine application of this discretized diffusion approximation to the threshold model from evolutionary quantitative genetics, to a new "semi-threshold" trait evolution model, to a joint model of discrete and continuous traits in which the discrete trait influences the rate of evolution of our continuous character, as well as a model where precisely the converse is true, and to a discrete character dependent multi-trend trended continuous trait evolution model. We conclude with some context for the origins of our article and discussion of other possible applications of this powerful approach.
Ivan, J.; Lanfear, R.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWMany phylogenomic studies used non-overlapping windows to address gene tree discordance across a set of aligned genomes. Recently, Ivan et al. (2025) proposed an information theoretic approach to choose an optimal window size given the alignment. However, this approach selects only a single fixed window size per chromosome, which is a useful first step but fails to account for variation in the size of non-recombining regions along each chromosome. Such variation is expected to occur due to the stochastic nature of recombination as well as the variation in recombination rates along chromosomes. In this study, we extend the approach of Ivan et al. (2025) to allow window sizes to vary across the chromosome, using a splitting-and-merging strategy that allows for each window to be of an arbitrary length. We showed that the new method outperformed the fixed-window approach in recovering gene tree topologies on a wide range of simulated datasets. Applying the new method on the genomes of seven Heliconius butterflies, we found that the average window sizes for the group ranged between 538-808bp, but with a very similar distribution of gene tree topologies compared to previous studies that used fixed window sizes. For the genomes of great apes, the average window sizes ranged from 4.2kb to 6.2kb, with the proportion of the major topology (i.e., grouping human and chimpanzee together) reaching approximately 80%. In conclusion, our study highlights the limitations of using a fixed window size when recombination rates vary across the chromosomes, and proposes a splitting-and-merging approach that allows for variable window sizes across whole genome alignments.
Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.
Show abstract
Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557
Parmigiani, L.; Peterlongo, P.
Show abstract
A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.
Fletcher, W. L.; Sinha, S.
Show abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
Jackson, K. C.; Carilli, M. T.; Pachter, L.
Show abstract
Contrastive principal component analysis (PCA) methods are effective approaches to dimensionality reduction where variance of a target dataset is maximized while variance of a background dataset is minimized. We previously described how contrastive PCA problems can be written as solutions to generalized eigenvalue problems that maximize particular instantiations of the Rayleigh quotient. Here, we discuss two extensions of contrastive PCA: we use kernel weighting from spatial PCA (k-{rho}PCA) to contrast spatial and non-spatial axes of variation, and separately solve the Rayleigh quotient in the space of basis function coefficients (f-{rho}PCA) to find modes of variation in functional data. Together, these extensions expand the scope of contrastive PCA while unifying disparate fields of spatial and functional methods within a single conceptual and mathematical framework. We showcase the utility of these extensions with several examples drawn from genomics, analyzing gene expression in cancer and immune response to vaccination.
Rinon, E. M.; Visaya, M. V.; Sambayan, R.
Show abstract
Kernel methods offer a robust framework for integrating multi-modal datasets into a unified representation, thereby facilitating more comprehensive data interpretation. In the presence of incomplete datasets, multiple kernel learning is employed to enhance the efficiency of data completion and integration. We investigate kernel-based approaches to address the incomplete-data problem with applications to yeast protein data. Biological data such as yeast proteins can be represented through multiple modalities, including gene expression profiles, amino acid sequences, three-dimensional structures, and protein interaction networks. We introduce a computational pipeline based on kernel matrix completion, in which topological data analysis (TDA) and persistent spectral analysis are incorporated into the classification setting. TDA captures geometric structure across scales while spectral descriptors reflect connectivity patterns through Laplacian eigenvalues. Kernel, topological, and spectral descriptors are used with support vector machines to discriminate between membrane and non-membrane yeast proteins. Empirical results show that the combined pipeline improves both kernel completion accuracy and ROC performance relative to baseline kernel-only approaches. The best-performing configuration achieves an ROC score of 0.8632 using the average of three kernels augmented with TDA features. These results demonstrate competitive performance relative to strong kernel-based baselines under incomplete data conditions. The proposed approach provides a unified approach for learning from incomplete heterogeneous data while enriching kernel representations with geometric and spectral information.
Ren, H.; Jiang, C.; Wong, T. K. F.; Shao, Y.; Susko, E.; Minh, B. Q.; Lanfear, R.
Show abstract
Partitioned and mixture models are widely employed in Maximum Likelihood phylogenetic analyses of large genomic datasets. Comparing the fit of the two types of models has been challenging, because standard information-theoretic approaches cannot be applied. Mixture models are increasingly popular for the analysis of amino acid datasets and can lead to different conclusions compared to partitioned models. This raises an important question - which type of model tends to perform better? Susko et al. (2026) recently introduced the marginal Akaike information criterion (mAIC), which allows mixture models and partitioned models to be directly compared for the first time. Here, we use the mAIC and a range of other approaches to compare the fit of mixture and partitioned models across a diverse set of empirical datasets. We show that mixture models are universally favoured on amino acid datasets. This has important implications for interpreting empirical analyses and suggests that continued development of mixture models is an important avenue for future research.
Nagel, A. A.; Landis, M. J.
Show abstract
Ancestral state reconstruction is a classical problem of broad relevance in phylogenetics. Likelihood-based methods for reconstructing ancestral states under discrete character models, such as Markov models, have proven extremely useful, but only work so long as the assumed model yields a tractable likelihood function. Unfortunately, extending a simple but tractable phylogenetic model to possess new, but biologically realistic, properties often results in an intractable likelihood, preventing its use in standard modeling tasks, including ancestral state reconstruction. The rapid advancement of deep learning offers a potential alternative to likelihood-based inference of ancestral states, particularly for models with intractable likelihoods. In this study, we modify the phylogenetic deep learning software O_SCPLOWPHYDDLEC_SCPLOW to conduct ancestral state reconstruction. We evaluate O_SCPLOWPHYDDLEC_SCPLOWs performance under various methodological and modeling conditions, while comparing to Bayesian inference when possible. For simple models and small trees, its performance resembles the performance of Bayesian inference, but worsens as tree size increases. While O_SCPLOWPHYDDLEC_SCPLOW still performs adequately for more complex models, such as speciation and extinction models, the estimates differ more from Bayesian inference in comparison with simpler models. Lastly, we use O_SCPLOWPHYDDLEC_SCPLOW to infer ancestral states for two empirical datasets, one of the ancestral ranges of a subclade of the genus Liolaemus and ancestral locations for sequences from the 2014 Sierra Leone Ebola virus disease outbreak.
Vasylenko, L.; Livnat, A.
Show abstract
At the fundamental conceptual level, two alternatives have traditionally been considered for how mutations arise and how evolution happens: 1) random mutation and natural selection, and 2) Lamarckism. Recently, the theory of Interaction-based Evolution (IBE) has been proposed, according to which mutations are neither random nor Lamarckian, but are influenced by information accumulating internally in the genome over generations. Based on the estimation-of-distribution algorithms framework, we present a simulation model that demonstrates nonrandom, non-Lamarckian mutation concretely while capturing indirectly several aspects of IBE: selection, recombination, and nonrandom, non-Lamarckian mutation interact in a complementary fashion; evolution is driven by the interaction of parsimony and fit; and random bits do not directly encode improvement but enable generalization by the manner in which they connect with the rest of the evolutionary process. Connections are drawn to Darwins observations that changed conditions increase the rate of production of heritable variation; to the causes of bell-shaped distributions of traits and how these distributions respond to selection; and to computational learning theory, where analogizing evolution to learning in accord with IBE casts individuals as examples and places the learned hypothesis at the population level. The model highlights the importance of incorporating internal integration of information through heritable change in both evolutionary theory and evolutionary computation.
Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.
Show abstract
Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.
Berv, J. S.; Fox, N.; Thorstensen, M. J.; Lloyd-Laney, H.; Troyer, E. M.; Rivero-Vega, R. A.; Smith, S. A.; Friedman, M.; Fouhey, D. F.; Weeks, B. C.
Show abstract
O_LIHigh-dimensional comparative datasets, including geometric morphometric landmarks, functional traits, and other large trait datasets, are increasingly common in biology. When these datasets include a large number of traits relative to the number of taxa, they pose significant challenges for phylogenetic comparative analysis. In addition, evolutionary dynamics are often heterogeneous across phylogenies, challenging researchers to develop tools that can localize and account for such variation when investigating hypotheses of evolutionary change. C_LIO_LIWe present bifrost, an R package for detecting and characterizing shifts in multivariate trait evolution across phylogenetic trees. bifrost implements a stepwise greedy search over alternative macroevolutionary regime configurations on a phylogeny. Candidate shifts are proposed and assessed at internal nodes, accelerated with parallel model fitting where possible, and aggregated sequentially when they exceed a user-defined information-criterion acceptance threshold. C_LIO_LIThe underlying model is a scalar-rate multivariate Brownian motion process fit by generalized least squares using mvMORPH::mvgls [1]. Our framework also provides support estimates for individual shifts using information-criterion weights. C_LIO_LIWe illustrate the workflow using a fossil-tip-dated phylogeny and high-dimensional landmark data for early bony fish jaws (32,508 scalar coordinate values), and discuss tuning, outputs, and limitations. bifrost extends existing phylogenetic comparative frameworks for evolutionary analysis and provides a scalable pipeline for exploring the phylogenetic natural history of large multivariate datasets. C_LI
King, B.
Show abstract
Simulation-based calibration (SBC) checking is a method to ensure that the inference machinery for a Bayesian statistical analysis is functioning in a correct and unbiased manner. Typically, SBC begins with sampling parameter values from the model priors (prior SBC). However, it has been shown that prior SBC can miss problems when these manifest only in certain regions of parameter space. In phylogenetics, this is relevant not only because of the vastness of tree and parameter space, but also because many phylogenetic analyses involve some degree of model misspecification. Posterior SBC is a recently developed method for checking that the inference algorithms function correctly for a given empirical dataset. Here I use posterior SBC to test the implementation of phylogenetic dating methods in the inference software BEAST 2. I test both the tip-dated approach, employing an Indo-European vocabulary dataset, and the node-dated approach, employing a molecular rRNA dataset of Tabanidae (horseflies). In both cases, posterior SBC tests indicate good calibration. Despite this, posterior predictive datasets simulated from the posterior distribution provided no further increase in the precision of node age estimates compared to the original posterior, a result consistent with previous literature showing fundamental theoretical limits to the identifiability of node ages. Nevertheless, these results suggest that phylogenetic dating methods in BEAST 2 are not biased by problems with the inference machinery, thereby increasing confidence in results obtained using these methods.
Garay, J.; Mori, T. F.
Show abstract
Price equation and genotype dynamics are two methods for studying the fixation of one allele by natural selection in a diploid population. There are two strict monotonicity conditions that imply the fixation of one allele. The genotype dynamics is called Haldane monotone if the relative frequency of one allele strictly increases along all solutions of the genotype dynamics, so this allele is fixed. In this paper, we show that the genotype dynamics is Haldane monotone if and only if the right-hand side of the Price equation is always strictly positive. The other strict monotonicity condition requires that the relative frequency of a homozygote strictly increase according to the genotype dynamics. For example, in a model where the genotype dynamics is governed by interactions between individuals, the cost-accepting homozygote is fixed by natural selection if the other genotypes always receive a smaller average gain from all interactions than the cost-accepting homozygote. Both monotonicity conditions require that the interaction is not well-mixed in the population. These two conditions are not equivalent. In addition, we give a non-monotonicity condition, which also implies the fixation of a homozygote. The fixation of a homozygote depends on the phenotypic payoff of the interaction, the genotype-phenotype mapping, and the interaction scheme. In a sexual population, the interaction scheme of siblings depends on the mating system, and so do the conditions of fixation of the cost-accepting homozygote. We present examples showing that if we only change the monogamous mating system, assuming panmixing or mating assortativity, then the condition for the fixation of the cooperator homozygote is b > 2c and b > c, respectively.
Shur, A.; Tziony, I.; Orenstein, Y.
Show abstract
Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.
Ane, C.; Bastide, P.
Show abstract
Most phylogenetic comparative methods use a species-level phylogeny, ignoring the effect of incomplete lineage sorting (ILS) and hemiplasy on the traits of interest. We consider here a trait controlled additively by one or more unknown loci. Their gene trees may differ from the species phylogeny due to ILS, as modeled by the coalescent process. If the species phylogeny is a network, this process also accounts for gene flow, admixture or hybridization. Our model allows for polymorphism in the ancestral population at the root of the species phylogeny, and predicts heritable within-population variation due to ILS. Even if each locus evolves according to a Brownian motion, the joint distribution of all trait measurements is not generally Gaussian due to ILS. We provide a Gaussian approximation, named the Gaussian Coalescent, and show how to compute its variance matrix efficiently using a single traversal of the species phylogeny. In simulations, this model is much more accurate than the model ignoring ILS. In simulations and on a data set of tomato floral traits, it is favored over the standard Brownian motion model with extra within-population variance. The GC model opens new avenues for various phylogenetic comparative methods, accounting for hemiplasy and gene flow simultaneously. It is implemented in phylolm v2.7.0 and in PhyloTraits v1.2.0.